[MUSIC]
This lecture is about the expectation
maximization algorithms or
also called the EM algorithms.
In this lecture,
we're going to continue the discussion
of probabilistic topic models.
In particular,
we're going to introduce the EM algorithm.
Which is a family of useful algorithms for
computing the maximum life or
estimate of mixture models.
So, this is now a familiar scenario
of using two components, the mixture
model to try to fact out the background
words from one topic or word distribution.
Yeah.
So, we're interested in computing
this estimate and
we're going to try to adjust these
probability values to maximize
the probability of the observed documents.
And know that we're assumed all
the other parameters are known.
So, the only thing unknown is these water
properties, this given by zero something.
And in this lecture, we're going to look
into how to compute this maximum like or
estimate.
Now this started with the idea of
separating the words in
the text data into two groups.
One group will be explained
by the background model.
The other group will be explained
by the unknown topical order.
After all this is the basic
idea of the mixture model.
But, suppose we actually know which
word is from which distribution.
So that would mean, for example,
these words, the, is, and
we, are known to be from this
background origin, distribution.
On the other hand,
the other words, text mining,
clustering, etcetera are known to be
from the topic word, distribution.
If you can see the color,
that these are showing blue.
These blue words are, they are assumed
to be from the topic word, distribution.
If we already know how
to separate these words.
Then the problem of estimating
the word distribution
would be extremely simple, right?
If you think about this for
a moment, you'll realize that, well,
we can simply take all these
words that are known to be from
this word distribution,
see that's a d and normalize them.
So indeed this problem would be
very easy to solve if we had known
which words are from which
it is written precisely.
And this is in fact,
making this model no longer a mystery
model because we can already observe which
of these distribution has been used
to generate which part of the data.
So we, actually go back to the single
order distribution problem.
And in this case, let's call these words
that are known to be from theta d,
a pseudo document of d prime.
And now all we have to do is
just normalize these word
accounts for each word, w sub i.
And that's fairly straightforward,
and it's just dictated by
the maximum estimator.
Now, this idea, however,
doesn't work because we in practice,
don't really know which word
is from which distribution.
But this gives us an idea of perhaps
we can guess which word is
from which distribution.
Specifically, given all the parameters,
can we infer the distribution
a word is from?
So let's assume that we actually
know tentative probabilities for
these words in theta sub d.
So now all the parameters are known for
this mystery model.
Now let's consider word, like a text.
So the question is,
do you think text is more likely,
having been generated from theta sub d or
from theta sub b?
So, in other words,
we are to infer which distribution
has been used to generate this text.
Now, this inference process is a typical
of basing an inference situation,
where we have some prior about
these two distributions.
So can you see what is our prior here?
Well, the prior here is the probability
of each distribution, right.
So the prior is given by
these two probabilities.
In this case, the prior is saying
that each model is equally likely.
But we can imagine perhaps
a different apply is possible.
So this is called a pry
because this is our guess
of which distribution has been
used to generate the word.
Before we even observed the word.
So that's why we call it a pry.
If we don't observe the word we don't
know what word has been observed.
Our best guess is to say,
well, they're equally likely.
So it's just like flipping a coin.
Now in basic inference,
we typical them with our belief
after we have observed the evidence.
So what is the evidence here?
Well, the evidence here is the word text.
Now that we know we're
interested in the word text.
So text can be regarded as evidence.
And if we use base
rule to combine the prior and
the theta likelihood,
what we will end up with
is to combine the prior
with the likelihood that you see here.
Which is basically the probability of
the word text from each distribution.
And we see that in both
cases text is possible.
Note that even in the background
it is still possible,
it just has a very small probability.
So intuitively what would be
your guess seeing this case?
Now if you're like many others,
you would guess text is probably
from c.subd it's more likely from c.subd,
why?
And you will probably see
that it's because text has
a much higher probability
here by the C now sub D than
by the background model which
has a very small probability.
And by this we're going to say well,
text is more likely from theta sub d.
So you see our guess of which
distributing has been used with
the generated text would depend on
how high the probability of the data,
the text, is in each word distribution.
We can do tentative guess that
distribution that gives is a word
higher probability.
And this is likely to
maximize the likelihood.
All right, so we are going to choose
a word that has a higher likelihood.
So, in other words we are going to
compare these two probabilities
of the word given by each
of these distributions.
But our guess must also
be affected by the prior.
So we also need to
compare these two priors.
Why?
Because imagine if we
adjust these probabilities.
We're going to say,
the probability of choosing
a background model is almost 100%.
Now if we have that kind of strong prior,
then that would affect your gas.
You might think,
well, wait a moment, maybe texter could
have been from the background as well.
Although the probability is very
small here the prior is very high.
So in the end, we have to combine the two.
And the base formula
provides us a solid and
principle way of making this
kind of guess to quantify that.
So more specifically, let's think about
the probability that this word text
has been generated in
fact from theta sub d.
Well, in order for text to be generated
from theta sub d, two things must happen.
First, the theta sub d
must have been selected.
So, we have the selection
probability here.
And secondly we also have to actually have
observed the text from the distribution.
So, when we multiply the two together,
we get the probability
that text has in fact been
generated from zero sub d.
Similarly, for the background model and
the probability of generating text
is another product of similar form.
Now we also introduced late in
the variable z here to denote
whether the word is from the background or
the topic.
When z is 0, it means it's from the topic,
theta sub d.
When it's 1, it means it's from
the background, theta sub B.
So now we have the probability
that text is generated from each,
then we can simply normalize
them to have estimate
of the probability that
the word text is from
theta sub d or from theta sub B.
And equivalently the probability
that Z is equal to zero,
given that the observed evidence is text.
So this is application of base rule.
But this step is very crucial for
understanding the EM hours.
Because if we can do this,
then we would be able to first,
initialize the parameter
values somewhat randomly.
And then, we're going to take
a guess of these Z values and
all, which distributing has been
used to generate which word.
And the initialize the parameter values
would allow us to have a complete
specification of the mixture model,
which allows us to apply Bayes'
rule to infer which distribution is
more likely to generate each word.
And this prediction essentially helped us
to separate words from
the two distributions.
Although we can't separate them for sure,
but we can separate then
probabilistically as shown here.
[MUSIC]

